{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [FMA: A Dataset For Music Analysis](https://github.com/mdeff/fma)\n",
    "\n",
    "Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, Xavier Bresson, EPFL LTS2.\n",
    "\n",
    "## Creation\n",
    "\n",
    "From `raw_*.csv`, this notebook generates:\n",
    "* `tracks.csv`: per-track / album / artist metadata.\n",
    "* `genres.csv`: genre hierarchy.\n",
    "* `echonest.csv`: cleaned Echonest features.\n",
    "\n",
    "A companion script, [creation.py](creation.py):\n",
    "1. Query the [API](https://freemusicarchive.org/api) and store metadata in `raw_tracks.csv`, `raw_albums.csv`, `raw_artists.csv` and `raw_genres.csv`.\n",
    "2. Download the audio for each track.\n",
    "3. Trim the audio to 30s clips.\n",
    "4. Normalize the permissions and modification / access times.\n",
    "5. Create the `.zip` archives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import ast\n",
    "import pickle\n",
    "\n",
    "import IPython.display as ipd\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import utils\n",
    "import creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "AUDIO_DIR = os.environ.get('AUDIO_DIR')\n",
    "BASE_DIR = os.path.abspath(os.path.dirname(AUDIO_DIR))\n",
    "FMA_FULL = os.path.join(BASE_DIR, 'fma_full')\n",
    "FMA_LARGE = os.path.join(BASE_DIR, 'fma_large')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 Retrieve metadata and audio from FMA\n",
    "\n",
    "1. Crawl the tracks, albums and artists metadata through their [API](https://freemusicarchive.org/api).\n",
    "2. Download original `.mp3` by HTTPS for each track id (only if we don't have it already).\n",
    "\n",
    "Todo:\n",
    "* Scrap curators.\n",
    "* Download images (`track_image_file`, `album_image_file`, `artist_image_file`). Beware the quality.\n",
    "* Verify checksum for some random tracks.\n",
    "\n",
    "Dataset update:\n",
    "* To add new tracks: iterate from largest known track id to the most recent only.\n",
    "* To update user data: we need to get all tracks again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ./creation.py metadata\n",
    "# ./creation.py data /path/to/fma/fma_full\n",
    "# ./creation.py clips /path/to/fma\n",
    "\n",
    "#!cat creation.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# converters={'genres': ast.literal_eval}\n",
    "tracks = pd.read_csv('raw_tracks.csv', index_col=0)\n",
    "albums = pd.read_csv('raw_albums.csv', index_col=0)\n",
    "artists = pd.read_csv('raw_artists.csv', index_col=0)\n",
    "genres = pd.read_csv('raw_genres.csv', index_col=0)\n",
    "\n",
    "not_found = pickle.load(open('not_found.pickle', 'rb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_fs_tids(audio_dir):\n",
    "    tids = []\n",
    "    for _, dirnames, files in os.walk(audio_dir):\n",
    "        if dirnames == []:\n",
    "            tids.extend(int(file[:-4]) for file in files)\n",
    "    return tids\n",
    "\n",
    "audio_tids = get_fs_tids(FMA_FULL)\n",
    "clips_tids = get_fs_tids(FMA_LARGE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('tracks: {} collected ({} not found, {} max id)'.format(\n",
    "    len(tracks), len(not_found['tracks']), tracks.index.max()))\n",
    "print('albums: {} collected ({} not found, {} in tracks)'.format(\n",
    "    len(albums), len(not_found['albums']), len(tracks['album_id'].unique())))\n",
    "print('artists: {} collected ({} not found, {} in tracks)'.format(\n",
    "    len(artists), len(not_found['artists']), len(tracks['artist_id'].unique())))\n",
    "print('genres: {} collected'.format(len(genres)))\n",
    "print('audio: {} collected ({} not found, {} not in tracks)'.format(\n",
    "    len(audio_tids), len(not_found['audio']), len(set(audio_tids).difference(tracks.index))))\n",
    "print('clips: {} collected ({} not found, {} not in tracks)'.format(\n",
    "    len(clips_tids), len(not_found['clips']), len(set(clips_tids).difference(tracks.index))))\n",
    "assert sum(tracks.index.isin(audio_tids)) + len(not_found['audio']) == len(tracks)\n",
    "assert sum(tracks.index.isin(clips_tids)) + len(not_found['clips']) == sum(tracks.index.isin(audio_tids))\n",
    "assert len(clips_tids) + len(not_found['clips']) + len(not_found['audio']) == len(tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 5\n",
    "ipd.display(tracks.head(N))\n",
    "ipd.display(albums.head(N))\n",
    "ipd.display(artists.head(N))\n",
    "ipd.display(genres.head(N))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 Format metadata\n",
    "\n",
    "Todo:\n",
    "* Sanitize values, e.g. list of words for tags, valid links in `artist_wikipedia_page`, remove html markup in free-form text.\n",
    "    * Clean tags. E.g. some tags are just artist names.\n",
    "* Fill metadata about encoding: length, number of samples, sample rate, bit rate, channels (mono/stereo), 16bits?.\n",
    "* Update duration from audio\n",
    "    * 2624 is marked as 05:05:50 (18350s) although it is reported as 00:21:15.15 by ffmpeg.\n",
    "    * 112067: 3714s --> 01:59:55.06, 112808: 3718s --> 01:59:59.56\n",
    "    * ffmpeg: Estimating duration from bitrate, this may be inaccurate\n",
    "    * Solution, decode the complete mp3: `ffmpeg -i input.mp3 -f null -`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df, column = tracks, 'tags'\n",
    "null = sum(df[column].isnull())\n",
    "print('{} null, {} non-null'.format(null, df.shape[0] - null))\n",
    "df[column].value_counts().head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Tracks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop = [\n",
    "    'license_image_file', 'license_image_file_large', 'license_parent_id', 'license_url',  # keep title only\n",
    "    'track_file', 'track_image_file',  # used to download only\n",
    "    'track_url', 'album_url', 'artist_url',  # only relevant on website\n",
    "    'track_copyright_c', 'track_copyright_p',  # present for ~1000 tracks only\n",
    "    # 'track_composer', 'track_lyricist', 'track_publisher',  # present for ~4000, <1000 and <2000 tracks\n",
    "    'track_disc_number',  # different from 1 for <1000 tracks\n",
    "    'track_explicit', 'track_explicit_notes',  # present for <4000 tracks\n",
    "    'track_instrumental'  # ~6000 tracks have a 1, there is an instrumental genre\n",
    "]\n",
    "tracks.drop(drop, axis=1, inplace=True)\n",
    "tracks.rename(columns={'license_title': 'track_license', 'tags': 'track_tags'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks['track_duration'] = tracks['track_duration'].map(creation.convert_duration)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_datetime(df, column, format=None):\n",
    "    df[column] = pd.to_datetime(df[column], infer_datetime_format=True, format=format)\n",
    "convert_datetime(tracks, 'track_date_created')\n",
    "convert_datetime(tracks, 'track_date_recorded')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks['album_id'].fillna(-1, inplace=True)\n",
    "tracks['track_bit_rate'].fillna(-1, inplace=True)\n",
    "tracks = tracks.astype({'album_id': int, 'track_bit_rate': int})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_genres(genres):\n",
    "    genres = ast.literal_eval(genres)\n",
    "    return [int(genre['genre_id']) for genre in genres]\n",
    "\n",
    "tracks['track_genres'].fillna('[]', inplace=True)\n",
    "tracks['track_genres'] = tracks['track_genres'].map(convert_genres)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Albums"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop = [\n",
    "    'artist_name', 'album_url', 'artist_url',  # in tracks already (though it can be different)\n",
    "    'album_handle',\n",
    "    'album_image_file', 'album_images',  # todo: shall be downloaded\n",
    "    #'album_producer', 'album_engineer',  # present for ~2400 albums only\n",
    "]\n",
    "albums.drop(drop, axis=1, inplace=True)\n",
    "albums.rename(columns={'tags': 'album_tags'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_datetime(albums, 'album_date_created')\n",
    "convert_datetime(albums, 'album_date_released')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "albums.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Artists"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "drop = [\n",
    "    'artist_website', 'artist_url',  # in tracks already (though it can be different)\n",
    "    'artist_handle',\n",
    "    'artist_image_file', 'artist_images',  # todo: shall be downloaded\n",
    "    'artist_donation_url', 'artist_paypal_name', 'artist_flattr_name',  # ~1600 & ~400 & ~70, not relevant\n",
    "    'artist_contact',  # ~1500, not very useful data\n",
    "    # 'artist_active_year_begin', 'artist_active_year_end',  # ~1400, ~500 only\n",
    "    # 'artist_associated_labels',  # ~1000\n",
    "    # 'artist_related_projects',  # only ~800, but can be combined with bio\n",
    "]\n",
    "artists.drop(drop, axis=1, inplace=True)\n",
    "artists.rename(columns={'tags': 'artist_tags'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "convert_datetime(artists, 'artist_date_created')\n",
    "for column in ['artist_active_year_begin', 'artist_active_year_end']:\n",
    "    artists[column].replace(0.0, np.nan, inplace=True)\n",
    "    convert_datetime(artists, column, format='%Y.0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "artists.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Merge DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "not_found['albums'].remove(None)\n",
    "not_found['albums'].append(-1)\n",
    "not_found['albums'] = [int(i) for i in not_found['albums']]\n",
    "not_found['artists'] = [int(i) for i in not_found['artists']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks = tracks.merge(albums, left_on='album_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))\n",
    "\n",
    "n = sum(tracks['album_title_dup'].isnull())\n",
    "print('{} tracks without extended album information ({} tracks without album_id)'.format(\n",
    "    n, sum(tracks['album_id'] == -1)))\n",
    "assert sum(tracks['album_id'].isin(not_found['albums'])) == n\n",
    "assert sum(tracks['album_title'] != tracks['album_title_dup']) == n\n",
    "\n",
    "tracks.drop('album_title_dup', axis=1, inplace=True)\n",
    "assert not any('dup' in col for col in tracks.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Album artist can be different than track artist. Keep track artist.\n",
    "#tracks[tracks['artist_name'] != tracks['artist_name_dup']].select(lambda x: 'artist_name' in x, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks = tracks.merge(artists, left_on='artist_id', right_index=True, sort=False, how='left', suffixes=('', '_dup'))\n",
    "\n",
    "n = sum(tracks['artist_name_dup'].isnull())\n",
    "print('{} tracks without extended artist information'.format(n))\n",
    "assert sum(tracks['artist_id'].isin(not_found['artists'])) == n\n",
    "assert sum(tracks['artist_name'] != tracks[('artist_name_dup')]) == n\n",
    "\n",
    "tracks.drop('artist_name_dup', axis=1, inplace=True)\n",
    "assert not any('dup' in col for col in tracks.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = []\n",
    "for name in tracks.columns:\n",
    "    names = name.split('_')\n",
    "    columns.append((names[0], '_'.join(names[1:])))\n",
    "tracks.columns = pd.MultiIndex.from_tuples(columns)\n",
    "assert all(label in ['track', 'album', 'artist'] for label in tracks.columns.get_level_values(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Todo: fill other columns ?\n",
    "tracks['album', 'tags'].fillna('[]', inplace=True)\n",
    "tracks['artist', 'tags'].fillna('[]', inplace=True)\n",
    "\n",
    "columns = [('album', 'favorites'), ('album', 'comments'), ('album', 'listens'), ('album', 'tracks'),\n",
    "           ('artist', 'favorites'), ('artist', 'comments')]\n",
    "for column in columns:\n",
    "    tracks[column].fillna(-1, inplace=True)\n",
    "columns = {column: int for column in columns}\n",
    "tracks = tracks.astype(columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 Data cleaning\n",
    "\n",
    "Todo: duplicates (metadata and audio)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def keep(index, df):\n",
    "    old = len(df)\n",
    "    df = df.loc[index]\n",
    "    new = len(df)\n",
    "    print('{} lost, {} left'.format(old - new, new))\n",
    "    return df\n",
    "\n",
    "tracks = keep(tracks.index, tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Audio not found or could not be trimmed.\n",
    "tracks = keep(tracks.index.difference(not_found['audio']), tracks)\n",
    "tracks = keep(tracks.index.difference(not_found['clips']), tracks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Errors from the `features.py` script.\n",
    "* IndexError('index 0 is out of bounds for axis 0 with size 0',)\n",
    "    * ffmpeg: Header missing\n",
    "    * ffmpeg: Could not find codec parameters for stream 0 (Audio: mp3, 0 channels, s16p): unspecified frame size. Consider increasing the value for the 'analyzeduration' and 'probesize' options\n",
    "    * tids: 117759\n",
    "* NoBackendError()\n",
    "    * ffmpeg: Format mp3 detected only with low score of 1, misdetection possible!\n",
    "    * tids: 80015, 115235\n",
    "* UserWarning('Trying to estimate tuning from empty frequency set.',)\n",
    "    * librosa error\n",
    "    * tids: 1440, 26436, 38903, 57603, 62095, 62954, 62956, 62957, 62959, 62971, 86079, 96426, 104623, 106719, 109714, 114501, 114528, 118003, 118004, 127827, 130298, 130296, 131076, 135804, 154923\n",
    "* ParameterError('Filter pass-band lies beyond Nyquist',)\n",
    "    * librosa error\n",
    "    * tids: 152204, 28106, 29166, 29167, 29169, 29168, 29170, 29171, 29172, 29173, 29179, 43903, 56757, 59361, 75461, 92346, 92345, 92347, 92349, 92350, 92351, 92353, 92348, 92352, 92354, 92355, 92356, 92358, 92359, 92361, 92360, 114448, 136486, 144769, 144770, 144771, 144773, 144774, 144775, 144778, 144776, 144777"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature extraction failed.\n",
    "FAILED = [1440, 26436, 28106, 29166, 29167, 29168, 29169, 29170, 29171, 29172,\n",
    "          29173, 29179, 38903, 43903, 56757, 57603, 59361, 62095, 62954, 62956,\n",
    "          62957, 62959, 62971, 75461, 80015, 86079, 92345, 92346, 92347, 92348,\n",
    "          92349, 92350, 92351, 92352, 92353, 92354, 92355, 92356, 92357, 92358,\n",
    "          92359, 92360, 92361, 96426, 104623, 106719, 109714, 114448, 114501,114528,\n",
    "          115235, 117759, 118003, 118004, 127827, 130296, 130298, 131076, 135804, 136486,\n",
    "          144769, 144770, 144771, 144773, 144774, 144775, 144776, 144777, 144778, 152204,\n",
    "          154923]\n",
    "tracks = keep(tracks.index.difference(FAILED), tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# License forbids redistribution.\n",
    "tracks = keep(tracks['track', 'license'] != 'FMA-Limited: Download Only', tracks)\n",
    "print('{} licenses'.format(len(tracks[('track', 'license')].unique())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#sum(tracks['track', 'title'].duplicated())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4 Genres"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "genres.drop(['genre_handle', 'genre_color'], axis=1, inplace=True)\n",
    "genres.rename(columns={'genre_parent_id': 'parent', 'genre_title': 'title'}, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "genres['parent'].fillna(0, inplace=True)\n",
    "genres = genres.astype({'parent': int})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 13 (Easy Listening) has parent 126 which is missing\n",
    "# --> a root genre on the website, although not in the genre menu\n",
    "genres.at[13, 'parent'] = 0\n",
    "\n",
    "# 580 (Abstract Hip-Hop) has parent 1172 which is missing\n",
    "# --> listed as child of Hip-Hop on the website\n",
    "genres.at[580, 'parent'] = 21\n",
    "\n",
    "# 810 (Nu-Jazz) has parent 51 which is missing\n",
    "# --> listed as child of Easy Listening on website\n",
    "genres.at[810, 'parent'] = 13\n",
    "\n",
    "# 763 (Holiday) has parent 763 which is itself\n",
    "# --> listed as child of Sound Effects on website\n",
    "genres.at[763, 'parent'] = 16\n",
    "\n",
    "# Todo: should novelty be under Experimental? It is alone on website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Genre 806 (hiphop) should not exist. Replace it by 21 (Hip-Hop).\n",
    "print('{} tracks have genre 806'.format(\n",
    "    sum(tracks['track', 'genres'].map(lambda genres: 806 in genres))))\n",
    "def change_genre(genres):\n",
    "    return [genre if genre != 806 else 21 for genre in genres]\n",
    "tracks['track', 'genres'] = tracks['track', 'genres'].map(change_genre)\n",
    "genres.drop(806, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_parent(genre, track_all_genres=None):\n",
    "    parent = genres.at[genre, 'parent']\n",
    "    if track_all_genres is not None:\n",
    "        track_all_genres.append(genre)\n",
    "    return genre if parent == 0 else get_parent(parent, track_all_genres)\n",
    "\n",
    "# Get all genres, i.e. all genres encountered when walking from leafs to roots.\n",
    "def get_all_genres(track_genres):\n",
    "    track_all_genres = list()\n",
    "    for genre in track_genres:\n",
    "        get_parent(genre, track_all_genres)\n",
    "    return list(set(track_all_genres))\n",
    "\n",
    "tracks['track', 'genres_all'] = tracks['track', 'genres'].map(get_all_genres)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of tracks per genre.\n",
    "def count_genres(subset=tracks.index):\n",
    "    count = pd.Series(0, index=genres.index)\n",
    "    for _, track_all_genres in tracks.loc[subset, ('track', 'genres_all')].items():\n",
    "        for genre in track_all_genres:\n",
    "            count[genre] += 1\n",
    "    return count\n",
    "\n",
    "genres['#tracks'] = count_genres()\n",
    "genres[genres['#tracks'] == 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_genre(track_genres):\n",
    "    top_genres = set(genres.at[genres.at[genre, 'top_level'], 'title'] for genre in track_genres)\n",
    "    return top_genres.pop() if len(top_genres) == 1 else np.nan\n",
    "\n",
    "# Top-level genre.\n",
    "genres['top_level'] = genres.index.map(get_parent)\n",
    "tracks['track', 'genre_top'] = tracks['track', 'genres'].map(get_top_genre)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "genres.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 Subsets: large, medium, small"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Large\n",
    "\n",
    "Main characteristic: the full set with clips trimmed to a manageable size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Medium\n",
    "\n",
    "Main characteristic: clean metadata (includes 1 top-level genre) and quality audio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fma_medium = pd.DataFrame(tracks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Missing meta-information.\n",
    "\n",
    "# Missing extended album and artist information.\n",
    "fma_medium = keep(~fma_medium['album', 'id'].isin(not_found['albums']), fma_medium)\n",
    "fma_medium = keep(~fma_medium['artist', 'id'].isin(not_found['artists']), fma_medium)\n",
    "\n",
    "# Untitled track or album.\n",
    "fma_medium = keep(~fma_medium['track', 'title'].isnull(), fma_medium)\n",
    "fma_medium = keep(fma_medium['track', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)\n",
    "fma_medium = keep(fma_medium['album', 'title'].map(lambda x: 'untitled' in x.lower()) == False, fma_medium)\n",
    "\n",
    "# One tag is often just the artist name. Tags too scarce for tracks and albums.\n",
    "#keep(fma_medium['artist', 'tags'].map(len) >= 2, fma_medium)\n",
    "\n",
    "# Too scarce.\n",
    "#fma_medium = keep(~fma_medium['album', 'information'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'bio'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'website'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'wikipedia_page'].isnull(), fma_medium)\n",
    "\n",
    "# Too scarce.\n",
    "#fma_medium = keep(~fma_medium['artist', 'location'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'latitude'].isnull(), fma_medium)\n",
    "#fma_medium = keep(~fma_medium['artist', 'longitude'].isnull(), fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Technical quality.\n",
    "# Todo: sample rate\n",
    "fma_medium = keep(fma_medium['track', 'bit_rate'] > 100000, fma_medium)\n",
    "\n",
    "# Choosing standard bit rates discards all VBR.\n",
    "#fma_medium = keep(fma_medium['track', 'bit_rate'].isin([320000, 256000, 192000, 160000, 128000]), fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fma_medium = keep(fma_medium['track', 'duration'] >= 60, fma_medium)\n",
    "fma_medium = keep(fma_medium['track', 'duration'] <= 600, fma_medium)\n",
    "\n",
    "fma_medium = keep(fma_medium['album', 'tracks'] >= 1, fma_medium)\n",
    "fma_medium = keep(fma_medium['album', 'tracks'] <= 50, fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Lower popularity bound.\n",
    "fma_medium = keep(fma_medium['track', 'listens'] >= 100, fma_medium)\n",
    "fma_medium = keep(fma_medium['track', 'interest'] >= 200, fma_medium)\n",
    "fma_medium = keep(fma_medium['album', 'listens'] >= 1000, fma_medium);\n",
    "\n",
    "# Favorites and comments are very scarce.\n",
    "#fma_medium = keep(fma_medium['artist', 'favorites'] >= 1, fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Targeted genre classification.\n",
    "fma_medium = keep(~fma_medium['track', 'genre_top'].isnull(), fma_medium);\n",
    "#keep(fma_medium['track', 'genres'].map(len) == 1, fma_medium);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Adjust size with popularity measure. Should be of better quality.\n",
    "N_TRACKS = 25000\n",
    "\n",
    "# Observations\n",
    "# * More albums killed than artists --> be sure not to kill diversity\n",
    "# * Favorites and preterites genres differently --> do it per genre?\n",
    "# Normalization\n",
    "# * mean, median, std, max\n",
    "# * tracks per album or artist\n",
    "# Test\n",
    "# * 4/5 of same tracks were selected with various set of measures\n",
    "# * <5% diff with max and mean\n",
    "\n",
    "popularity_measures = [('track', 'listens'), ('track', 'interest')]  # ('album', 'listens')\n",
    "# ('track', 'favorites'), ('track', 'comments'),\n",
    "# ('album', 'favorites'), ('album', 'comments'),\n",
    "# ('artist', 'favorites'), ('artist', 'comments'),\n",
    "\n",
    "normalization = {measure: fma_medium[measure].max() for measure in popularity_measures}\n",
    "def popularity_measure(track):\n",
    "    return sum(track[measure] / normalization[measure] for measure in popularity_measures)\n",
    "fma_medium['popularity_measure'] = fma_medium.apply(popularity_measure, axis=1)\n",
    "fma_medium = keep(fma_medium.sort_values('popularity_measure', ascending=False).index[:N_TRACKS], fma_medium)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp = genres[genres['parent'] == 0].reset_index().set_index('title')\n",
    "tmp['#tracks_medium'] = fma_medium['track', 'genre_top'].value_counts()\n",
    "tmp.sort_values('#tracks_medium', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Small\n",
    "\n",
    "Main characteristic: genre balanced (and echonest features).\n",
    "\n",
    "Choices:\n",
    "* 8 genres with 1000 tracks --> 8,000 tracks\n",
    "* 10 genres with 500 tracks --> 5,000 tracks\n",
    "\n",
    "Todo:\n",
    "* Download more echonest features so that all tracks can have them. Otherwise intersection of tracks with echonest features and one top-level genre is too small."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N_GENRES = 8\n",
    "N_TRACKS = 1000\n",
    "\n",
    "top_genres = tmp.sort_values('#tracks_medium', ascending=False)[:N_GENRES].index\n",
    "fma_small = pd.DataFrame(fma_medium)\n",
    "fma_small = keep(fma_small['track', 'genre_top'].isin(top_genres), fma_small)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "to_keep = []\n",
    "for genre in top_genres:\n",
    "    subset = fma_small[fma_small['track', 'genre_top'] == genre]\n",
    "    drop = subset.sort_values('popularity_measure').index[:-N_TRACKS]\n",
    "    fma_small.drop(drop, inplace=True)\n",
    "assert len(fma_small) == N_GENRES * N_TRACKS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.4 Subset indication"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SUBSETS = ('small', 'medium', 'large')\n",
    "tracks['set', 'subset'] = pd.Series().astype('category', categories=SUBSETS, ordered=True)\n",
    "tracks.loc[tracks.index, ('set', 'subset')] = 'large'\n",
    "tracks.loc[fma_medium.index, ('set', 'subset')] = 'medium'\n",
    "tracks.loc[fma_small.index, ('set', 'subset')] = 'small'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.5 Echonest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "echonest = pd.read_csv('raw_echonest.csv', index_col=0, header=[0, 1, 2])\n",
    "echonest = keep(~echonest['echonest', 'temporal_features'].isnull().any(axis=1), echonest)\n",
    "echonest = keep(~echonest['echonest', 'audio_features'].isnull().any(axis=1), echonest)\n",
    "echonest = keep(~echonest['echonest', 'social_features'].isnull().any(axis=1), echonest)\n",
    "\n",
    "echonest = keep(echonest.index.isin(tracks.index), echonest);\n",
    "keep(echonest.index.isin(fma_medium.index), echonest);\n",
    "keep(echonest.index.isin(fma_small.index), echonest);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6 Splits: training, validation, test\n",
    "\n",
    "Take into account:\n",
    "* Artists may only appear on one side.\n",
    "* Stratification: ideally, all characteristics (#tracks per artist, duration, sampling rate, information, bio) and targets (genres, tags) should be equally distributed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for genre in genres.index:\n",
    "    tracks['genre', genres.at[genre, 'title']] = tracks['track', 'genres_all'].map(lambda genres: genre in genres)\n",
    "\n",
    "SPLITS = ('training', 'test', 'validation')\n",
    "PERCENTAGES = (0.8, 0.1, 0.1)\n",
    "tracks['set', 'split'] = pd.Series().astype('category', categories=SPLITS)\n",
    "\n",
    "for subset in SUBSETS:\n",
    "\n",
    "    tracks_subset = tracks['set', 'subset'] <= subset\n",
    "\n",
    "    # Consider only top-level genres for small and medium.\n",
    "    genre_list = list(tracks.loc[tracks_subset, ('track', 'genre_top')].unique())\n",
    "    if subset == 'large':\n",
    "        genre_list = list(genres['title']) \n",
    "\n",
    "    while True:\n",
    "        if len(genre_list) == 0:\n",
    "            break\n",
    "\n",
    "        # Choose most constrained genre, i.e. genre with the least unassigned artists.\n",
    "        tracks_unsplit = tracks['set', 'split'].isnull()\n",
    "        count = tracks[tracks_subset & tracks_unsplit].set_index(('artist', 'id'), append=True)['genre']\n",
    "        count = count.groupby(level=1).sum().astype(np.bool).sum()\n",
    "        genre = np.argmin(count[genre_list])\n",
    "        genre_list.remove(genre)\n",
    "        \n",
    "        # Given genre, select artists.\n",
    "        tracks_genre = tracks['genre', genre] == 1\n",
    "        artists = tracks.loc[tracks_genre & tracks_subset & tracks_unsplit, ('artist', 'id')].value_counts()\n",
    "        #print('-->', genre, len(artists))\n",
    "\n",
    "        current = {split: np.sum(tracks_genre & tracks_subset & (tracks['set', 'split'] == split)) for split in SPLITS}\n",
    "\n",
    "        # Assign artists with most tracks first.\n",
    "        for artist, count in artists.items():\n",
    "            choice = np.argmin([current[split] / percentage for split, percentage in zip(SPLITS, PERCENTAGES)])\n",
    "            current[SPLITS[choice]] += count\n",
    "            #assert tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')].isnull().all()\n",
    "            tracks.loc[tracks['artist', 'id'] == artist, ('set', 'split')] = SPLITS[choice]\n",
    "\n",
    "# Tracks without genre can only serve as unlabeled data for training, e.g. for semi-supervised algorithms.\n",
    "no_genres = tracks['track', 'genres_all'].map(lambda genres: len(genres) == 0)\n",
    "no_split = tracks['set', 'split'].isnull()\n",
    "assert not (no_split & ~no_genres).any()\n",
    "tracks.loc[no_split, ('set', 'split')] = 'training'\n",
    "\n",
    "# Not needed any more.\n",
    "tracks.drop('genre', axis=1, level=0, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7 Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for dataset in 'tracks', 'genres', 'echonest':\n",
    "    eval(dataset).sort_index(axis=0, inplace=True)\n",
    "    eval(dataset).sort_index(axis=1, inplace=True)\n",
    "    params = dict(float_format='%.10f') if dataset == 'echonest' else dict()\n",
    "    eval(dataset).to_csv(dataset + '.csv', **params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ./creation.py normalize /path/to/fma\n",
    "# ./creation.py zips /path/to/fma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8 Description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracks = utils.load('tracks.csv')\n",
    "tracks.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 5\n",
    "ipd.display(tracks['track'].head(N))\n",
    "ipd.display(tracks['album'].head(N))\n",
    "ipd.display(tracks['artist'].head(N))"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}
